Cell Nuclei Segmentation in Microscopic Images
CPSC 532L
Hooman Shariati 21155098
1. Introduction
The overarching goal of this project is to create a computer model that can identify and segment out a range of
nuclei across varied conditions. Identifying the cells’ nuclei is the starting point for most analyses because the
nucleus of a cell contains its full DNA. As a result, identifying nuclei allows researchers to identify each individual
cell in a sample, and by measuring how cells react to various treatments, the researcher can understand the
underlying biological processes at work. Consequently, automating the process of identifying nuclei will allow
for more efficient drug testing, which can dramatically reduce the time that it takes for new drugs to come to
market.
This task is made difficult by the diversity and limited size of the available draining data. First of all, segmenting
cell nuclei by hand is time consuming and difficult to do with high precision, which limits the amount of
annotated (i.e. labeled) examples. Secondly, there are a large number of cell types, image modalities, and
magnification levels for which the proposed classifier must perform within parameters. Figure 1 shows several
cell types, image modalities, and magnification levels present in the training set. Figure 2 shows the result
applying the ground truth segmentations (present in the training set) to highlight differences in the nuclei of
different cell types.
Figure 1
Figure 2
Specifically, in this project, nuclei of 278 different cell types must be segmented across 3 different image
modalities (Bright field, Stained/Histological, Fluorescence) and 12 different magnification levels. Given the
small size of the dataset (670 images), several cell types, imaging modalities and magnification levels are
dramatically underrepresented.
Since any single classifier will have a difficult time generalizing across such a diverse and biased dataset, in this
project, two completely different approaches are explored and compared. One based on Semantic
Segmentation using U-Net [1] and another one based on Instance Segmentation using Mask R-CNN [2]. This will
act as a preliminarily study which enables a methodology based on an ensemble or fusion of several different
classifiers for future work. In the following, section 2. Data will discuss the details of the dataset, section 3.
Methodology will describe the approaches taken in this project including the theory behind U-Net and Mask R-
CNN models, section 4. Implementation describes the implementation details of each model, section 5. Results
discusses the experimental results, section 6. describes the future works for this project, section 7. concludes
the project, section 8. contains the references, and finally section 9 is the appendix.
2. Data Exploration
The data required for this project comes from the 2018 Data Science Bowl in 3 sets; one set of images along with
the segmented masks (labeled dataset) that is used for training, and two test sets which contain just the images
and no masks; used for validation and testing. The trained models are submitted to Kaggle for validation on the
first test set, and later for testing on the second test set. Only Kaggle has access to the segmented masks for the
two test sets to ensure that they are not used in the training process in any way.
All 3 datasets contain a large number of segmented nuclei images. The images were acquired under a variety of
conditions and vary in the cell type, magnification, and imaging modality (Bright field, Stained/Histological,
Fluorescence). The dataset is designed to challenge an algorithms ability to generalize across these variations.
More specifically, the training set contains 670 images, where each image contains 10-200 cells of the same
type. Each image is represented by an associated ImageId. Files belonging to an image are contained in a folder
with this ImageId.
Within this folder are two subfolders:
Images directory contains the image file.
Masks directory contains the segmented masks of each nucleus. This folder is only included in the
training set. Each mask contains one nucleus. Masks are not allowed to overlap (no pixel belongs to two
masks).
Figure 3 shows a single training example (image) from the training set along with its ground truth segmented
mask. Please note that the training dataset, we have a single mask for each cell nucleus (i.e. for each instance).
In Figure 3, I have combined all the cells into a single mask to save space; in figure 2, you can see the masks as
they are in the dataset (one mask for each cell).
Figure 3
2.1 Clustering to detect image types
As you can see in Figure 2, there is a lot of diversity in the dataset. Some images are gray scale with black
background and nuclei in gray scale intensities, some images are in color, and some images seemed to be black
on white. As such, I performed K-Means clustering on the RGB pixel values (by converting them to HSV after
throwing away the alpha channel) to detect different types of images based on the colors present in them (i.e.
based on dominant HSV colors distributions). I ran k-means with several numbers of clusters but got the best
clustering results using 3 clusters. This makes sense since there are 3 imaging modalities in the dataset, which
implies at least 3 clusters in the dataset as well.
Running the clustering algorithm proved the assumption above to be through. There are 3 clearly separated
clusters corresponding to the 3 image modalities. In the training set, I found the following distribution:
It looks like we have the following breakdown:
Cluster 1 Bright-field images: 2.4%
Cluster 2 Fluorescent images: 81.5%
Cluster 3 Histological images: 16.1%
Figure 4 shows some images taken using Fluorescent imaging modality (the first cluster), Figure 5 shows some
images taken in using Stained or Histological imaging modality and Figure 6 shows some images taken suing
Bright-Field imaging (third cluster).
Figure 4: First cluster (Bright-field images)
Figure 5: Second cluster (Fluorescent images)
Figure 6: Third Cluster (Histological images)
2.2 Distribution over image sizes
In addition to different image modalities and magnification levels, the images in our datasets have various
shapes too. Figure 7 shows the distribution over image sizes for both the training set and the test set.
Figure 7
As you can see above, not only the distribution is skewed, but also there are some image sizes in test set that
are not in the training set. These could be of different modality or magnification. There are 7 out of 16 different
shapes that are in Test data set only and not in the train data set. This implies that the final models must be able
to generalize across image sizes.
3. Methodology
As far as the competition is concerned, the output of the over-all system should be a binary mask where the
pixels belonging to a cell’s nuclei are represented by 1 and background pixels are represented by 0.
Consequently, at first glance, this might seem like a semantic segmentation problem (i.e. classifying each pixel as
either background or a nucleus).
However, a valid submission requires that no two predicted masks for the same image are overlapping. In other
words, as shown in figure 8, for two overlapping nuclei two individual masks must be returned. This implies that
performing instance segmentation (i.e. multi-class classification of pixels) might outperform the models based
on semantic segmentation in this competition (i.e. based on the evaluation metric specified below).
Figure 8
The following are the results of preliminary comparison between semantic segmentation using U-Net, multi-
class classification of each pixel using improved U-Net, and instance segmentation using Mask-R-CNN on a small
portion of the training set (approximately 10% of examples selected at random). The evaluation metrics is the
mean average precision at different intersection over union (IoU) thresholds as described in the next section.
U-Net (single class, i.e. semantic segmentation) : 0.35
Enhanced U-Net (multi-class with post processing) : 0.42
Mask-RCNN (instance segmentation) : 0.50
In this project, two models are trained and compared in terms of their performance for this task: Basic U-Net
(semantic segmentation) and Mask R-CNN (instance segmentation). The problem is posed in two different
frameworks (instance segmentation V.S. semantic segmentation) in order to gage the importance of each
approach in the task of segmenting cell nuclei without any over-lap, without differentiating between different
instances.
In addition, in practice, multiple models might be used for this task in an ensemble method (such as non-max
suppression) in order to take advantage of the independent errors made by each classifier. Basically, for nuclei
that are not detected by Mask R-CNN, we can fall back on semantic segmentations using U-Net. Also, since the
results from different classifiers are cumulative, one can always convert an ensemble system to single network
using knowledge distil or teacher-student learning for practical applications.
3.1 Evaluation Metric
Experimental results are obtained by measuring the performance of the algorithm on the two test sets held by
Kaggle. The prediction results of the algorithm on test sets are uploaded to Kaggle which compares them to the
ground truth (segmented masks) available only to Kaggle and publishes the results.
The models performance is evaluated on the mean average precision at different intersection over union (IoU)
thresholds. The IoU of a proposed set of object pixels and a set of true object pixels is calculated as:
The metric sweeps over a range of IoU thresholds, at each point calculating an average precision value. The
threshold values range from 0.5 to 0.95 with a step size of 0.05: (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9,
0.95). In other words, at a threshold of 0.5, a predicted object is considered a hit if its intersection over union
with a ground truth object is greater than 0.5.
At each threshold value t, a precision value is calculated based on the number of true positives (TP), false
negatives (FN), and false positives (FP) resulting from comparing the predicted object to all ground truth objects:
A true positive is counted when a single predicted object matches a ground truth object with an IoU above the
threshold. A false positive indicates a predicted object had no associated ground truth object. A false negative
indicates a ground truth object had no associated predicted object. The average precision of a single image is
then calculated as the mean of the above precision values at each IoU threshold:
Finally, the score returned by Kaggle is the mean taken over the individual average precisions of each image in
the test dataset.
3.2 Semantic Segmentation using U-Net
U-Net [1] is used in this project in order to provide a base-line performance using a relatively simple model. In
addition, it provides the means for comparing semantic segmentation in contrast with instance segmentation, as
applied to this particular task. U-Net was chosen among several semantic segmentation methods because it has
won the Grand Challenge for Computer-Automated Detection of Caries in Bitewing Radiography at ISBI 2015,
and it has won the Cell Tracking Challenge at ISBI 2015 on the two most challenging transmitted light microscopy
categories (Phase contrast and DIC microscopy) by a large margin. The task and the datasets evaluated in these
competitions are very similar to the task of segmenting cell nuclei and our given dataset which means U-Net has
a good chance of performing well in this application.
Similar to how Mask-RCNN is an extension of the R-CNN line of models for instance segmentation, U-Net can be
thought of as an extension to the FCN (Fully Convolutional Network) line of models for semantic segmentation.
Since FCNs do not use any connected layers, they have been heavily used for dense predictions because they are
much faster compared to the patch classification approach. FCNs allow for segmentation maps to be generated
for images of any size.
However, aside from the computational expense of fully connected layers (which FCNs aim to overcome), CNNs
have a problem with pooling layers. Pooling layers increase the field of view and are able to aggregate the
context while discarding the spatial information. However, semantic segmentation requires the exact alignment
of class maps and thus, needs the spatial information to be preserved. U-Net tries to solve this problem by
utilizing encoder-decoder architecture. The general idea is that the encoder gradually reduces the spatial
dimension with pooling layers while the decoder gradually recovers the object details and spatial dimensions.
Another method for solving this problem would be using dilated/atrous convolutions to replace pooling layers
all together.
The network architecture of basic U-Net is show in figure 9. It consists of a contracting path (left side) and an
expansive path (right side). The contracting path follows the typical architecture of a convolutional network by
down-sampling the input using 3*3 Convolutional layers (with no padding) while increasing its depth (number of
filters); followed by ReLU and a max pooling operation with strides of 2 (to cut the image dimensions by half).
The expansive path, reduces the number of filters (again using convolutions) while increasing input dimensions
by concatenating each feature map with the correspondingly cropped feature map from the contracting path.
The cropping is necessary due to the loss of border pixels in convolutions. Finally, at the last layer, 1*1
convolutions is used to map each one of the 64 dimensional feature vectors to the desired number of classes.
Figure 9
As you can see in Figure 9, the architecture of U-Net consists of a few (2 to 4 in my experiments) encoding and
the same number of decoding layers. The encoding layers are used to extract different levels of contextual
feature maps. The decoding layers are designed to combine these feature maps produced by the encoding
layers to generate the desired segmentation maps. Larger number of encoder/decoder stages with larger sizes
has been shown to lead to better performance at the cost of higher computational requirements.
3.3 Instance Segmentation using Mask R-CNN
Figure 10 below shows the typical pipeline for instance segmentation. As you can see, instance segmentation
combines elements from the classical computer vision tasks of object detection, where the goal is to classify
individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to
classify each pixel into a fixed set of categories without differentiating object instances. Instance segmentation is
challenging because it requires the correct detection of all objects in an image while also precisely segmenting
each instance.
Figure 10
Mask R-CNN [2] is the next extension in the R-CNN line of models. It extends Faster R-CNN [3] (the latest R-CNN
model) in two ways: 1. by adding a third branch for predicting segmentation masks on each Region of Interest
(RoI), in parallel with the existing branch for classification and bounding box regression. 2. By adding a newly
designed layer called RoIAlign which fixes the misalignment between extracted features and the input caused by
the quantization performed by the RoIPool layer used in Faster R-CNN [4].
Figure 11 shows the evolution of R-CNN line of architectures from the original R-CNN [3] to Mask R-CNN [2]. As
you can see, Mask R-CNN is a modular composition of several recent ideas and has the following major
components which are all individually modifiable to the specific task.
1. Backbone architecture
2. Feature Pyramid Network (FPN)
3. Region Proposal Network (RPN)
4. Region of interest feature alignment (RoIAlign)
5. Multi-task network head:
Box classifier
Box regressor
Mask predictor
Keypoint predictor
In this project, Mask-RCNN is used instead of Faster R-CNN because of its superior performance and that fact
that it adds only a small overhead to Faster R-CNN, running at 5 fps.
Figure 11
Please look at the appendix for 3.3.1 Faster R-CNN Review and 3.3.2 Mask R-CNN additions to Faster R-
CNN.
4. Implementation
4.1 U-Net
The U-Net for this project is implemented using Keras with a TensorFlow backend. You can find the code in the
Jupiter notebook contained at https://github.com/hooman67/Cell_Nuclei_Segmentation.
4.1.1 Pre-processing
As mentioned in the Data section, this dataset contains images of 5 different dimensions (128, 128), (256, 256),
(360, 360),(520, 696), and (1024, 1024). Since U-Net (like most other models) requires images to be of a fixed
canonical size, after loading the images and masks they are all down sampled to 128*128. Working with smaller
images, makes computations much faster, at the cost of losing information during down sampling.
Specifically in this task, one concern is that some cell nuclei are very small even in their original image size. After
down sampling, these nuclei might become smaller than the detection threshold of U-Net. Note that since 128
by 128 is the smallest dimension in our dataset, using this as the canonical size makes sense because it avoids
having to up sample any of the images (by interpolation for example). However, it should be noted that for the
purposes of the competition, Kaggle expects all masks to be in the original image size. Consequently, after
predicting a 128*128 mask, it should be up sampled to the original size in order to be run-length encoded and
submitted to Kaggle for validation. For this task, I used the resize function contained in scikit-image, with an
order 1 spline interpolation.
In addition, as mentioned in the Methodology section, since U-Net performs semantic segmentation, it requires
the ground truth masks in a single image. In our dataset, for each image, we have a variable number of masks
(one for each cell nucleus) of the same size as the image. Consequently, after resizing all images and mask to a
canonical size, I combine the all the masks available for each image (i.e. all the different nuclei in an image) into
a single mask. Since the masks are binary, this is done through a max operation that replaces each pixel with the
maximum value from all the masks. This is possible because the annotated masks are guaranteed not to overlap.
4.1.2 Data Augmentation
I experimented with various data augmentations, but they generally did not improve the performance of the
model. I tried Rotations with an angle of 45 degrees, Height and Weights shifts with a range of 0.1, shear
deformation with a range of 0.2, different zoom levels with range of 0.2, and both constant and reflect fill
modes. Applying these augmentations reduced the performance of the model in comparison with an identically
trained model without data augmentations. The only augmentations that have a very small positive effect on
the performance of the model were horizontal and vertical flips.
4.1.3 Network Architecture
The architecture of the original U-Net is shown in Figure 9. The network built in this project is loosely based on
the original paper. In addition to modifying the network to accommodate images of different size than the ones
in the paper, I have made the following modifications:
1. The original U-Net paper does not include any drop out layers. In order to reduce over fitting, I have added a
drop out layer with a probability of 0.1 to each stage. Due to the relatively small size of training data, it would be
easy to over fit a complex model. Consequently, I added dropout, which is a form of regularizing the network.
2. The original paper uses ReLU activations throughout the network. However, there is research that shows
Exponential Linear Units (ELUs) outperformed ReLUs on object classification and localization tasks on ImageNet
using various CNNs. Figure 16 shows the mathematical equation for ELUs and a plot comparison of ELUs with
ReLUs and Leaky ReLUs (LReLU).
Figure 12
Like ReLUs, leaky ReLUs (LReLUs) and parameterized ReLUs (PReLUs), ELUs also avoid a vanishing gradient via
the identity for positive values. However ELUs have improved learning characteristics compared to the other
activation functions. In contrast to ReLUs, ELUs have negative values which allow them to push mean unit
activations closer to zero. Zero means speed up learning because they bring the gradient closer to the unit
natural gradient. Like batch normalization, ELUs push the mean towards zero, but with a significantly smaller
computational footprint. While other activation functions like LReLUs and PReLUs also have negative values,
they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and
thereby decrease the propagated variation and information. Therefore ELUs code the degree of presence of
particular phenomena in the input, while they do not quantitatively model the degree of their absence.
Consequently dependencies between ELU units are much easier to model and distinct concepts are less likely to
interfere.
Since U-Net is based on an encoder-decoder design, its architecture depends on input sizes. This is because the
original image is first converted to a deep feature map of 8*8 images through a serious of convolutional and max
pooling layers, and is then converted to the original size through a serious of transpose convolutions.
Consequently, to investigate the effects of different canonical sizes, I had to implement two different networks
one with 9 stages for 128*128 images, and another with 13 stages for 512*512 images. You can find the result
of this investigation in the Results section.
Figure 17 shows the models optimized for 128*128 images (using 9 stages to take the input from 128*128 to
8*8 and back to 128*128) while Figure 18 shows the models optimized for 512*512 images (using 13 stages to
take the input from 512*512 to 8*8 and back to 512*512). Both models use Adam optimizer with the Binary
Cross Entropy loss. The accuracy metric for the optimizer must be the mean average precision at different
intersection over union (IoU) thresholds (10 values from 0.5 to 0.95). Tensorflow has a mean IoU implemented in
the in tf.metrics.mean_iou class, but has no method for finding the mean over multiple thresholds. As a result, I
implemented my own method (which can be find in the U-Net notebook in the repo).
Figure 13
Figure 18
4.1.4 Training
For the choice of optimizer, I used Adam with Binary Cross Entropy loss. The binary loss is acceptable here since
the segmentation masks are binary. For the evaluation metrics, I am using the average over mean IoUs as
described above. For the beta_1 and beta_2 parameters of the Adam optimizer, I am using the default values of
0.9 and 0.999, respectively.
The results presented here where obtained by starting from a Learning Rate of 0.001 and training the model for
50 epochs. However, I used an early stopper call back that ends training when the validation loss does not
change by more than 5 epochs. I use this because at this point, I have not applied any regularization to the
model (no L2 or L1). Furthermore, I am using no weight decay in the Adam optimizer either (i.e. the value of
weight decay was set to 0). With the early stopper and a learning rate of 0.001, the model ends training at
around 25 epochs. In addition, 10% of the training set was chosen at random and used for validation. The
validation loss was calculated five times at every epoch.
As a future works, I plan to more thoroughly train the model by baby-sitting the learning rate to reduce it every
time the early stopper ends the training process pre-maturely. When/If I get to the point that I realize the
model is over fitting, I might add L1/L2 regularization or weight decay to compensate. Right now, the only
mechanisms by which I battle over fitting are early stopping and Drop out layers in between all convolutional
and Transpose Convolutional layers.
4.2 Mask R-CNN
The implementation of Mask R-CNN for this project is based on Keras with a Tensorflow backend and is largely
influenced by the Matter Port implementation of Mask R-CNN [2] (add refrence). The code for this project can
be found in the github repo. I tried to follow the Mask R-CNN paper’s recommendations for the most part, but I
made the following changes to make the model more suitable for this particular task. For the backbone
architecture, I am using ResNet101.
4.2.1 Image Resizing
As mentioned in the U-Net implementation section, we have multiple image dimensions in our dataset. To have
a canonical size for the network (and to be able to use multiple images per batch), I resize all images to 512 by
512. In case of images that have different aspect ratios (i.e. are not squares), I pad them with zeros. Notice that,
in the original paper, the input images are not resized because they resize the masks to a small fixed size. Here, I
am resizing the images as well for consistency and to make training faster.
However, I do resize the masks to a fixed size by extracting the bounding box of the object and resizing it (just
the object) to a fixed size of 56 by 56. Since our dataset provides only the masks and no bounding boxes, I
generate my own. In doing so, I pick the smallest box that encapsulates all the pixels of the mask as the
bounding box. This simplifies the implementation and also makes it easy to apply certain image augmentations
that would otherwise be really hard to apply to bounding boxes, such as image rotation. Nonetheless, in terms
of data augmentation, I am only using horizontal flips. This is because after trying various crops, rotations,
transitions, and rescaling; I noticed that they do not improve the results. For the ROI Pooling layer, I use the
same parameters as suggested by the paper; a pool size of 7 for ROIs and a pool size of 14 for the masks.
4.2.2 Anchors in the Region Proposal Network (RPN)
The original paper suggests Anchor scales of 32, 64, 128, 256, and 512. These are the lengths of square anchor
side in pixels. However, since Cell nuclei are much smaller than the object that the Mask R-CNN in the original
paper was trying to detect, in this project, I am using Anchor scales of 4, 8, 16, 128, and 256 instead. This is
because in a preliminary study I realized that most of the nuclei are in the 8 pixel by 8 pixel range with some as
big as 100*100 pixel range. This is why I keep the first smallest scales (4, 8, and 16) but then jump to 128*128
pixels. In other words, there are very few nuclei in our dataset that are bigger than 16 by 16 and smaller than
128 by 128 pixels.
I use the same Aspect Ratios for Anchors as suggested in the paper (0.5, 1, and 2). For the Anchor stride, I use a
value of one, which produces 1 Anchor for every position (i.e. Pixel) in the backbone feature map. For the
strides of the FPN pyramid, I use the same values suggested in the paper (4, 8, 16, 32, 64) since these values
were optimized for ResNet101 backbone architecture that I am using. Using above values, the lowest level of
the pyramid has a stride of 4px relative to the image, so anchors are created at every 4 pixel intervals.
For the threshold of the none-max suppression stage, the original paper suggests 0.3 so that they can generate a
large number of proposals. However, I realized that a higher value of 0.7 or 0.8 can increase the performance on
the test set by having stricter criteria for proposals. For training, I use 256 Anchors for each image.
4.2.3 Training and Testing Regions of Interest (ROIs)
After None Max Suppression, I am keeping 2000 ROIs for training and 1000 ROIs during test time. The original
paper generates 1000 ROI per image (initially), which I have reduced to 600 here. This is because the paper
mentions that for the best results the sampling stage should pick up approximately 33% positive ROIs. Since my
images are smaller and have fewer objects compared to the task for which Mask R-CNN was used in the original
paper, I generate fewer ROIs per image initially, so that after down sampling I am left with 33% positive ROIs.
For the number of ROIs per image to feed to the classifier and mask heads, the original paper suggests 512.
However, I am using 200. Again, this is because when using 512, RPN generates too many positive proposals. So,
I am generating fewer proposals to begin with in order to keep the positive to negative ROI ratio at 1/3 as per
the paper’s suggestion. The criteria for deciding between positive and negative ROIs are the same described in
the Methodology section (and in the original paper). Also, for detection, I have set the minimum probability
value to accept a detected instance to 0.7. Basically, ROIs with confidence levels below 0.7 are ignored during
test time.
In the original Mask R-CNN paper, the maximum number of instances to use in training and to return in testing
must be specified. The original paper uses a value of 100 instances per image in both; meaning regardless of
how many instances there are in the image, we return only 100 objects. However, in this project, some images
have up to 500 nuclei in them. Unfortunately, I cannot set the maximum number of ground truth instances to
512 during training due to memory limits. Consequently, I use a maximum of 256 ground truth instances for
training (i.e. allowable objects in each image during training). For test time, however, I have set this number to
512 so that I can correctly classify images with large number of cell nuclei in them.
Finally, for the average RGB pixel values to differentiate different colors, I use the same values as suggested in
the paper (123.7, 116.8, 103.9). At a future time, I might change these values to better reflect the colors present
in our dataset (based on a more sophisticated clustering algorithm than the one I used in the data section). I am
also using the same values as suggested in the paper for the standard deviations used in the mask refinement
stage in both the RPN stage and the final refinement stage. For both the RPN bounding box refinement stage
and the final refinement stage these were suggested to be set at 0.1, 0.1, 0.2, and 0.2.
4.4.4 Training
Faster R-CNN was trained using a multi-step approach, training parts independently and merging the trained
weights before a final full training approach. However, the Mask R-CNN paper recommends an end-to-end
training strategy (i.e. joint training). In this project, after putting the complete model together, there are 4
different losses, two for the RPN and two for R-CNN. For this task, aside from training the layers in the RPN and
in the R-CNN, I have decided to fine-tune the ResNet101 backbone structure as well; starting from the weights
trained on the COCO dataset.
This is because the objects that the backbone model (i.e. ResNet101) was trained on (i.e. the images in the
COCO dataset) are very different from the objects we are trying to detect (i.e. Cells and their nuclei). Training
the ResNet101 architecture is very computationally expensive, however, required in order to squeeze all the
possible performance we can get out of the ResNet101. Consequently, starting from the COCO weights, I fine-
tune the ResNet101 backbone structure for 50 epochs, and then I freeze the backbone weights and train only
the heads (i.e. RPN and R-CNN) for 300 epochs.
During the end-to-end training, I combine the four different losses using a weighted sum; assigning the
classification losses higher weight relative to regression ones and the R-CNN losses more power over the RPN
losses. Apart from the regular losses, I also have the regularization losses which are defined both in RPN and in
R-CNN. I use L2 regularization in all of the layers in the RPN and in the R-CNN, but no regularization in the
backbone (ResNet101) network since the original training (on the COCO dataset) was performed without any
regularization.
For the choice of optimizer, I am using Stochastic Gradient Descent with a momentum value of 0.9, which is the
default. For Learning Rate, the Mask R-CNN paper uses a learning rate of 0.02, but I found that to be too high,
and often causes the weights to explode, especially when using a small batch size. This might be related to
differences between how Caffe and TensorFlow compute gradients (sum V.S. mean across batches and GPUs).
Or, maybe the official model uses gradient clipping to avoid this issue. I do use gradient clipping, but don't set it
too aggressively. In terms of batch sizes, since I have images that are as large as 1024 by 1024, I can only fit 2
images on the GPU (which has 12GB) when training the full end-to-end model (i.e. including the ResNet101).
When training just the head (i.e. RPN and R-CNN) I use a batch size of 8. If I change the backend architecture to
ResNet50, I can have a batch size as large as 16. However, I found that using ResNet50 instead of ResNet101 has
a relatively large effect on the accuracy, so I decided that the sacrifice in the accuracy would not be worth the
reduced training time.
Consequently, when training the entire network end-to-end (i.e. fine-tuning the backbone ResNet50 as well as
RPN and R-CNN heads), I start with a learning rate of 0.001 but set an early stopper call back to end the training
if the validation loss does not change within 5 epochs. Every time, the model early stops, I divide the starting
learning rate by 10 and resume the training. According to the documentation SGA optimizer automatically
decreases the learning rate by a factor of 10 every 50000 steps, but I found that to not be enough. When
training just the heads, I use a learning rate of 0.0001, but did not have to decrease it (i.e. the training ends
when the network reaches the specified number of epochs ==300 and not due to early stopping). As mentioned
previously, I am also using regularization. For the regularization parameter I am using a fixed value of 0.0001.
Finally, 10% of the training set was chosen at random and used for validation. I perform 5 validation steps per
epoch.
5. Discussion of Results and Related Works
Table 1 shows the results of the experiments performed using U-Net. All of the reported scores (both here and
everywhere else in this report, unless explicitly mentioned otherwise) are the mean average precision at
different intersection over union (IoU) thresholds, described in the Evaluation section. This is the score on the
test set calculated by Kaggle. I do not have access to this test set, which guarantees an unbiased approximation
of the classifiers performance.
Model and training description
Score
U-Net, 128*128 image sizes, 3 stages, early returned at 25 epochs, no data
augmentation
0.245
U-Net, 128*128 image sizes, 3 stages, early returned at 25 epochs, with data
augmentation
0.243
U-Net, 512*512 image sizes, 3 stages, early returned at 25 epochs, no data
augmentation
0.230
U-Net, 512*512 image sizes, 5 stages, early returned at 25 epochs, no data
augmentation
0.262
The first two rows of Table 1 show the results of training a U-Net with 3 encoder/decoder stages (i.e. the
structure shown in Figure 17) for 25 epochs. The number of epochs in the training was set to 50 but, in both
cases, the early stopper call-back terminated the training at 25 epochs (after 5 epochs for which the validation
loss did not decrease). The input images are all resized to 128*128 as described in the implementation section.
The optimizer was Adam with default Learning rate, Beta 1, and Beta 2 parameters with a decay rate of 0.
The only difference between the (first) two rows of Table 1 is that the model in first row did not include any data
augmentation. However, for the model in the second row, all of the data augmentations mentioned in the
implementation section were added. As you can see, data augmentation caused the test score to go down by
0.002 on the test set (as calculated by Kaggle), which is in line with observations made in other works in the
particular task of cell nuclei segmentation.
In 2018, Cue et al. [6] looked at several data augmentation techniques proposed for the task of cell nuclei
segmentation and concluded that none of them significantly improve the performance. The Histological /
stained images in our dataset are acquired using H&S staining in which the nuclei of cells are stained to blue by
Haematoxylin while cytoplasm is coloured pink by Eosin. Since in practice the color of H&E stained images could
vary a lot due to variation in the H&E reagents, staining process, scanner and the specialist who performs the
staining; several H&E stain normalization methods have been proposed to eliminate the negative interference
caused by color variation. However, Cue Et al. tried several of these techniques and observed no considerable
difference in the performance.
Another augmentation that Cue Et Al tried was using non-negative matrix factorization(NMF) to convert the
color space of the images in both training and test sets to a fixed color space; usually the color space of the best
stained H&E image in the training set. Again, Cue et al.’s results show that this augmentation method has little
effect on the performance of models in cell nuclei segmentation tasks. Finally, Cue Et al argues that the large
body of nuclei segmentation focused augmentation techniques that are based on using Deconvolution
algorithms to extract the H-channel from the images and use those as learning features, actually reduce the
performance in deep fully convolutional networks. In other words, according to Cue Et al, deep FCNs learn
better from raw RGB images than H-Channel gray-scale images (i.e. the information contained in RGB values is
actually relevant).
Aside from data augmentation techniques, several other improvements have been suggested in recent years to
improve the performance of U-Nets in nuclei segmentation. Zhao and Sun [7] proposed an architecture that is
very similar to U-Net (i.e. consists of a contracting path and an expansive path) but takes advantage of inception
modules and batch normalization instead of ordinary convolutional layers, which reduce the quantity of
parameters and accelerate training without loss of accuracy. I did not try this myself since my main focus is on
maximizing accuracy and not training speed.
Finally, Pena et al. [8] proposed a method for improving the performance of U-Net in segmenting cell nuclei in
images with high cell density/ clutter. Segmenting individual touching cell nuclei in cluttered regions is
challenging as the feature distribution on shared borders and cell foreground are similar, which makes it difficult
to correctly classify the pixels. To solve this, Pena et al. [8] proposed extending U-Net’s binary semantic
segmentation predictions to instance segmentation using 3 classes (background, foreground, and touching). To
help with this Pena et al, also, introduced a new multiclass weighted cross entropy loss function that takes into
account not only the cell geometry but also the class imbalance. Unfortunately, this technique is not applicable
in this project since our training and test images do not have labels that convey whether they are cluttered/
densely packed or not.
Table 2 shows the results of the experiments performed using Mask R-CNN. Again the reported score is the
mean average precision at different intersection over union (IoU) thresholds
Model and training description
Score
Mask R-CNN, ResNet50 backbone, all layers trained for 10 epochs
0.285
Mask R-CNN, ResNet101 backbone, all layers trained for 10 epochs
0.291
Mask R-CNN, ResNet101 backbone, tuned ROI and RPN hyper-parameters, all layers trained for 10
epochs
0.344
Mask R-CNN, ResNet101 backbone, with cleaned up training data , all layers trained for 10 epochs
0.318
Mask R-CNN, ResNet101 backbone, with cleaned up training data, and tuned ROI and RPN hyper-
parameters, all layers trained for 10 epochs
0.380
The first row of table 2 shows the result of training all the layers of Mask R-CNN for 10 epochs using a ResNet50
backbone and starting from weights pre-trained on COCO dataset. The second row shows the results when
ResNet101 is used as the backbone architecture instead of ResNet50. In both cases, the SGD optimizer was used
with a learning rate of 0.00001 with a momentum of 0.9 and weight decay of 0.0007. As you can see, (as
expected) ResNet101 performs better than ResNet50 at the cost of taking almost twice as long to train and test.
Further exploration of the predictions performed by the Mask R-CNN model using ResNet101 backbone (second
row of Table2) showed that several ground truth nuclei masks for the images in the training set have holes or
missing pieces. I performed a simple pre-processing space using OpenCV’s contour detection function to exclude
any masks that contain secondary level (i.e. contour inside of contour) contours. This step proved to increase
the performance of the classifier as you can see in row 4 of table 2. Figure 19 shows a few examples of masks
with holes in them, while Figure 20 shows a few masks with missing pieces.
Figure 19
Figure 20
The nuclei shown in Figure 21 represent the average size of nuclei present in the training set. We have several
cell types with nuclei much larger or much smaller than the ones shown in Figure 22. However, it makes sense to
adjust the ROI scales and aspect ratios to match the average sizes of nuclei present in our dataset, which is what
I am doing here. It is intuitive to assume that the closer the scales of the original Anchors are to the actual
objects we are interested in, the better the results.
Figure 21
Consequently, as mentioned in the implementation section, since most cell nuclei are much smaller than the
objects the original Mask R-CNN paper was interested in detecting, I have decreased the scales of my initial
anchors by a power of 4. Figures 22 and 23 show the highest scoring anchors (i.e. most positive anchors based
on their objectiveness score) that are proposed by the RPN before the refinement stage. Figure 23 shows these
anchors at the scale suggested by the Mask R-CNN paper, while Figure 22 shows the anchors after the scales are
reduced by a power of 4. As you can see the proposals in Figure 23 are much closer to the actual sizes of the
objects we are interested in (i.e. cell nuclei).
Figure 22
Figure 23
One of my biggest concerns was that reducing the scale of anchors to make them closer in size to the size of cell
nuclei would reduce the performance on images that contain cell nuclei that are much larger than the average
case. Figure 24 shows one of the cell types in our training set that has very large nuclei. As you can see, the top
scoring anchors proposed by RPN before refinement are generally smaller than the cell nuclei. However, as you
can see in Figure 25, the bounding box refinement stage of RPN seems to be able to take care of these
differences in scale. Figure 25 shows the proposed bounding boxes before the refinement in dotted lines and
after refinement in solid lines.
Figure 24
Figure 25
Moreover, as you can see in Table2, the experimental results support the theory that the refinement stage of
RPN can take care of anchors that are too small better than anchors that are too large as well. The score of the
model with the original anchor scales was 0.285 (first row of Table 2), whereas with anchor scales that are
smaller by a power of 4, the same model receives a score of 0.344 (third row of Table 2). Finally, the last row of
Table 2, shows the score for the same model as in row one that is trained on a cleaned up training set (i.e. one
that excludes masks with holes and missing pieces) using anchors with reduced scales. As you can see this model
receives a score of 0.380 which is much higher than the score of 0.285 achieved by the original model. Figure 26
shows an example of segmentation made by this model.
Figure 26
6. Future Works
The scope of future works for this project can be divided into 3 broad categories: improving the performance of
U-Net, improving the performance of Mask R-CNN, and combining the results from several classifiers in and
ensemble in order to achieve competitive results.
6.1 Improving U-Net
As mentioned in the implementation section, I would like to train the U-Net model over a larger number of
epochs. Currently, using the early stopper call-back, which ends the training process pre-maturely ones the
validation loss does not decrease within five epochs, causes the algorithm to early stop after 25 epochs. I would
like to avoid this (and thus train the model for more epochs) by increasing the decay rate of the learning rate
used by Adam optimizer. The documentation for Adam optimizer mentions that it automatically controls the
learning rate along the training, so it is not necessary to manually set the momentum and decay, but in my
experience a value of above 0 achieves much better results.
Furthermore, since the original U-Net paper was published in 2015, several improvements have been suggested
to make U-Net more suitable for the task for segmenting cell nuclei. Most recently (i.e. in 2018), Cui et al. [6]
suggested several modifications to the architecture of U-Net to improve the performance of U-Net for nuclei
segmentation of histo-pathological images. They propose a new method for initializing the encoder/decoder
weights using Glorot Uniform and propose the use of Scaled Exponential Linear Units (SELUs) instead of ReLU
activations. SELUs are designed to give self-normalizing capability to forward neural networks (FNN). And, Cui et
al. [6] shows that FNNs using SELUs outperform the ones using explicit normalization methods, such as batch
normalization, layer normalization, and weight normalization.
In addition to proposing an overlapped patch extraction and assembling method designed for seamless
prediction of nuclei in large whole-slide images, Cui et al. proposes a nuclei-boundary model to explicitly detect
nuclei and their boundaries simultaneously from histopathology images. This is done in order to solve the case
of over-lapping nuclei that was mentioned at the beginning of the Methods section. This paper shows that
detecting boundary is able to improve the accuracy of nuclei detection and help split the touched and
overlapped nuclei. As a future works for the U-Net portion of this project, I’d like to experiment with the effects
of above suggested improvements.
6.2 Improving Mask R-CNN
To improve the results produced by Mask R-CNN, I am planning on experimenting with different values for the
hyper parameters controlling the sizes and locations of RPN Anchors. In addition, I’d like to experiment with
different parameters for the ROIs in both training and testing. Ideally, I’d like to find the best parameters for the
confidence probability of selecting ROIs as positive or negative in order to decrease the number of false
positives given by Mask R-CNN. In addition, I’d like to optimize the model further by training both the backbone
ResNet101 and the RPN, Mask, and R-CNN heads. I might experiment with using the Adam optimizer instead of
SGA in order to have a reactive learning rate (i.e. have the learning rate decay automatically).
In addition, I’d like to further experiment with the effects of data augmentation on the performance of Mask R-
CNN on this task. Specifically, I’d like to experiment with the color normalizations technique that was described
in the implementation section for U-Net (i.e. H&E stain normalization methods). Finally, I’d like to investigate
the effects of image transformations such as None-Negative Matrix Factorization (or even normal PCA) on the
performance of Mask R-CNN instance segmentation. As mentioned previously, Cui et al. showed that converting
one image’s color into the target image’s color space based on sparse non-negative matrix factorization, did not
improve the performance of U-Net. However, I suspect this to be due to the fact that they are already using
SELU activations, which in effect perform normalization. It is possible that without using SELU activations, color
normalization based augmentation techniques might have a more significant effect on the performance of Mask
R-CNN in this particular task.
6.3 Ensemble of Several Methods
As mentioned in the Data section, performing K-Means clustering on the training and test datasets showed that
there are at least three clusters in our training datasets. At 3 clusters (i.e. k=3 in K-Means), K-Means produces
clusters that perfectly match the 3 imaging modalities present in our dataset, even though there might be more
low level sub-clusters within each image modality as well. As a future works, I’d like to further explore the
presence of clusters in the dataset. Ones a set of clusters are identified that are present in both the training set
and the test set, one might imagine fitting several different classifiers one on each cluster. By limiting the
feature space to a particular imaging modality (i.e. cluster), classifiers might show a big increase in performance.
However, the ultimate success of this approach is largely dependent on the clusters in the training set to be
similar to the ones present in the test set.
Aside from training models on individual clusters, one might consider fitting several models to bootstrapped
samples from the training set and combining them in an ensemble. For the methods that perform semantic
segmentation (i.e. U-Net), this can be done with a late fusion method. So far, I have tried averaging several U-
Net models and taking the maximum of predictions among models. Both approaches showed promise in
improving the performance, but averaging the predictions showed a higher improvement. Consequently, I’d like
to experiment with stacking several Semantic segmentation classifiers as well. One might imagine that feeding
the result of several classifiers into a separate network, and allowing it to pick the best features from each
prediction, can highly improve the results.
On the other hand, ensembling over Instance Segmentation methods (i.e. Mask R-CNN) is not as easily
performed as in the case of Semantic Segmentation models and requires a more sophisticated method such a
None Maximum Suppression. This is because, unlike semantic segmentation methods that provide per pixel
probabilities, instance segmentation methods provide scores that are specific to each instance. I’d like to
experiment with applying None-Maximum-Suppression to the bounding boxes returned by my two trained Mask
R-CNN model to combine their predictions while ignoring redundant and overlapping bounding boxes.
7. Conclusion
To solve this problem, to broad approaches; one based on Semantic Segmentation using U-Net and another one
based on instance segmentation using Mask R-CNN; were compared. Both models were truly explored and
several improvements, specific to this task, were suggested. Furthermore, several improvements to U-Net were
explored that were suggested by others for similar tasks. Finally, in the future works section, several possible
directions for improving the results was suggested, in addition to a new approach based on an ensemble of
several other models.
8. Works Cited
[1]
Philipp Fischer, and Thomas Brox Olaf Ronneberger, "U-Net: Convolutional Networks for Biomedical
Segmentation," cs.CV, May 2015.
[2]
Georgia Gkioxari, Piotr Dollar, Ross Girshick Kaiming He, "Mask R-CNN," Facebook AI Research (FAIR), January
2018.
[3]
Kaiming He, Ross Girshick, Jian Sun Shaoqing Ren, "Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks," csCV, january 2016.
[4]
J. Donahue, T. Darrell, and J. Malik R. Girshick, "Regionbased convolutional networks for accurate object
detection and segmentation," TPAMI, 2015.
[5]
Guiying Zhang, Zhonghao Liu, Zheng Xiong, Jianjun Hu Yuxin Cui, "A Deep Learning Algorithm for One-step
Contour Aware Nuclei Segmentation of Histopathological Images," csCV, March 2018.
[6]
Houlong Zhao and Nongliang Sun, "Improved U-Net Model for Nerve Segmentation," csCV.
[7]
Pedro D. Marrero Fernandez, Tsang Ing Ren Mary Yui, Ellen Rothenberg, Alexandre Cunha Fidel A. Guerrero-
Pena, "MULTICLASS WEIGHTED LOSS FOR INSTANCE SEGMENTATION OF CLUTTERED CELLS," csCV, Feburary
2018.
[8]
Ross Girshick, "Fast R-CNN," Microsoft Research, Sepetember 2015.
9. Appendix
3.3.1 Faster R-CNN Review
The line of R-CNN models started with the original R-CNN [4] paper published in 2014, which used Selective
Search to propose possible regions of interest and a standard Convolutional Neural Network (CNN) to classify
and adjust them. It quickly evolved into Fast R-CNN [5], published in early 2015, where a technique called Region
of Interest Pooling allowed for sharing expensive computations and made the model much faster. Finally came
Faster R-CNN [3], where the first fully differentiable model was proposed.
Figure 12 shows a more detailed architecture of Faster R-CNN. The goal is to predict a list of bounding boxes, a
class label for each bounding box specifying what object it contains, and a probability for each label and
bounding box. As you can see, Faster R-CNN has two networks: 1. a region proposal network (RPN) for
generating region proposals and a network using these proposals to detect objects.
Figure 14
Anchors are fixed bounding boxes that are placed throughout the image with different sizes and ratios
that are going to be used for reference when first predicting object locations.
We first pass the image through a backbone CNN architecture (like VGG or ResNet) to convert the special
information contained in the image to content information (i.e. a feature map). Say the output of our CNN
backbone is a feature map of size W*H*D. For each point in W*H, we create a set of fixed anchors. Since we
only have convolutional and pooling layers, the dimensions of the feature map will be proportional to those of
the original image. In other words, even though anchors are defined based on the convolutional feature map,
the final anchors reference the original image.
For example, if the image was W*H, the feature map will be of the size W/r * H/r with r being the subsampling
ratio of the network. If we define one anchor per spatial position of the feature map, the final image will end up
with a bunch of anchors separated by r pixels. In the case of VGG, r = 16.
In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. In order to choose
the set of anchors we usually define a set of sizes and a set of ratios between width and height of boxes. Figure
13 shows 9 anchors at the position (320, 320) of an image with size (600, 800). The 3 colors represent three
scales or sizes: 128x128, 256x256, 512x512. For each color (i.e. scales), the boxes show the 3 height width ratios
1:1, 1:2 and 2:1 respectively.
Figure 15
Region Proposal Network receives a set of Anchors and calculates the probability of each one belonging to
the foreground (i.e. showing an object as opposed to the background) and the probability of belonging to the
background. Next, the boundary box around each of the Anchors that have a high probability of being
foreground is adjusted using bounding box regression to better fit the object it’s predicting. The bounding box
refinement is performed by outputting 4 deltas for each Anchor showing the offsets that should be applied to its
x centre, y centre, width, and heights to make it contain the object more closely.
The RPN is implemented efficiently in a fully convolutional way, using the convolutional feature map returned by
the base network as an input. During training, we take all the anchors and put them into two different
categories. Those that overlap a ground-truth object with an Intersection over Union (IoU) bigger than 0.5 are
considered “foreground” and those that don’t overlap any ground truth object or have less than 0.1 IoU with
ground-truth objects are considered “background”.
Then, we randomly sample those anchors to form a mini batch that maintains a balanced ratio between
foreground and background anchors. The RPN uses all the anchors selected for the mini batch to calculate the
classification loss using binary cross entropy. Then, it uses only those mini batch anchors marked as foreground
to calculate the regression loss. For calculating the targets for the regression, we use the foreground anchor and
the closest ground truth object and calculate the correct deltas (centre, width, and height offsets) needed to
transform the anchor into the object.
For the refinement regression, the original paper suggests using Smooth L1 loss on the position (x ,y) of top-left
the box, and the logarithm of the heights and widths. Smooth L1 is basically L1, but when the L1 error is small
enough, defined by a certain sigma, the error is considered almost correct and the loss diminishes at a faster
rate.
The overall loss of the RPN is a combination of the classification loss and the regression loss.
Non-maximum suppression Since anchors usually overlap, proposals end up also overlapping over the same
object. Non-Maximum Suppression (NMS) is used to solve the issue of duplicate proposals. NMS takes the list of
proposals sorted by score and iterates over the sorted list, discarding those proposals that have an IoU larger
than some predefined threshold with a proposal that has a higher score. After applying NMS, we keep the top N
proposals sorted by score. In the paper N=2000N=2000 is used, but it is possible to lower that number to as little
as 50 and still get quite good results.
Region of Interest (ROI) Pooling At this stage, we need a feature map for each of our regions to feed to
the R-CNN for final classification (and bounding box refinement). A native approach would be to take each
proposal, crop it, and then pass it through the pre-trained backbone CNN network to get a feature map for that
region. However, feed forwarding around 2000 proposals through a CNN is too computationally inefficient.
Faster R-CNN tries to solve this problem by reusing the existing convolutional feature map. A layer, called Region
of Interest (ROI) pooling, receives the feature map for the full image (output of backbone CNN) along with the
region proposals from RPN. For each proposed region, ROI Pooling crops the feature map and then resizes the
crop to a fixed size using interpolation (usually bilinear) and then applies a max pooling on each fixed size region.
This is very important because most networks (including the R-CNN used in the next stage) require fixed feature
sizes. Varying bounding boxes makes this difficult. Unlike traditional max pooling layers, ROI Pooling splits the
input feature map into a fixed number (let’s say k) of roughly equal regions, and then apply Max-Pooling on
every region. Therefore the output of ROI Pooling is always k regardless the size of input.
Region-based convolutional neural network (R-CNN) is the final step in Faster R-CNN’s pipeline. After getting
a convolutional feature map from the image, using it to get object proposals with the RPN and finally extracting
features for each of those proposals (via RoI Pooling), R-CNN Classifies proposals into one of the classes, plus a
background class (for removing bad proposals) and more closely adjusts the bounding box for the proposal
according to the predicted class.
R-CNN tries to mimic the final stages of classification CNNs where a fully-connected layer is used to output a
score for each possible object class. As you can see in figure 14, the R-CNN takes the feature map for each
proposal, flattens it and uses two fully-connected layers of size 4096 (proposed by the original paper) with ReLU
activation. Then, it uses two different fully-connected layers for each of the different objects:
A fully-connected layer with N+1 units where N is the total number of classes and that extra one is for the
background class.
A fully-connected layer with 4N units that performs regression on each one of the possible N classes to output
deltas (i.e. offsets) for centre, width, and height of each bounding box.
Figure 16
3.3.2 Mask R-CNN additions to Faster R-CNN
As mentioned previously, Mask R-CNN adds two features to Faster R-CNN. A mask prediction branch and a new
type of layer called ROI Align.
ROI Align Layer as described above, RoIPool is a standard operation for extracting a small feature map
from each RoI. RoIPool first quantizes a ROI to the discrete granularity of the feature map, this quantized RoI is
then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin
are aggregated (usually by max pooling). Quantization is performed when feature maps strides are rounded up
(from float to integer) and when feature maps are divided into binds. This quantization introduces
misalignments between the RoI and the extracted features.
Since this misalignment has a large negative effect on predicting pixel-accurate masks, Mask R-CNN has
proposed a RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted
features with the input. To avoid any quantization of the RoI boundaries or bins, RoIAlign uses bilinear
interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI
bin, and aggregate the result (using max or average). These results are not sensitive to the exact sampling
locations, or how many points are sampled, as long as no quantization is performed. Below you can find more
mathematical details about RoIAlign layers.
Segmentation branch as shown in Figure 15, mask R-CNN keeps everything from Faster R-CNN and adds a
fully-convolutional (i.e. no fully-connected layers) branch, which produces a binary mask for each class label and
for each ROI.
Figure 17
As you can see in Figure 15 above, the mask branch independently takes the output of CNN layers and runs in
parallel with the Faster R-CNN branches for class prediction and bounding box regression.
Formally, during training, a multi-task loss is defined on each sampled RoI as:
The classification loss and bounding-box loss are identical as those defined in Faster R-CNN. The mask branch
has a Km^2 dimensional output for each RoI, which encodes K binary masks of resolution m*m, one for each of
the K classes. A per-pixel sigmoid is then applied to each of these masks which lead to the mask loss which is
defined as an average binary cross-entropy loss. For a RoI associated with ground-truth class k, mask loss is only
defined on the k-th mask (other mask outputs do not contribute to the loss). Below you can find more
mathematical details about the losses.
Notice that above definition for the mask loss allows the network to generate masks for every class without
competition among classes (i.e. the classification branch from Faster R-CNN is used to predict the class label
used to select the output mask).
This decoupled mask and class prediction is the main difference between this approach and our previous one
using Semantic Segmentation with U-Net. U-Net takes a Fully-Convolutional Network (FCN) based approach to
semantic segmentation, which typically uses a per-pixel Softmax and a multinomial cross-entropy loss. In that
case, masks across classes compete; in the case of Instance Segmentation using Mask R-CNN, with a per-pixel
sigmoid and a binary loss, they do not.